Section: New Results
High performance tensor–vector multiplication on shared-memory systems
Tensor–vector multiplication is one of the core components in tensor computations. We have recently investigated high performance, single core implementation of this bandwidth-bound operation. Here, we investigate its efficient, shared-memory implementations. Upon carefully analyzing the design space, we implement a number of alternatives using OpenMP and compare them experimentally. Experimental results on up to 8 socket systems show near peak performance for the proposed algorithms.
This work appears in the proceedings of PPAM2019 and is supported with a technical report [22], [36].